Document Length Normalization
نویسنده
چکیده
In the previous lecture we discussed pivoted document length normalization [Singhal et al. 96], a simple technique that applies a correction for the observation that document relevance correlates with document length. Through careful empirical verification of previous assumptions, they showed that the seemingly simple normalization term could have a big impact on results. However, in our discussion of the analysis that led to pivoted document length normalization, we passed over a basic question: How were the relevance judgments in the TREC dataset made on the approximately 740,000 documents and 50 queries?
منابع مشابه
CS 6740 : Advanced Language Technologies February 4 , 2010 Lecture 3 : Pivoted Document Length Normalization
In this lecture, we examine the impact of the length of a document on its relevance to queries. We show that document relevance is positively correlated with document length, and see that relevance scores that use the normalization techniques we’ve studied so far (L∞, L1, L2) do not capture this correlation correctly. Finally, we present the “pivoted document length normalization” technique int...
متن کاملDocument Normalization Revisited
Cosine Pivoted Document Length Normalization has reached a point of stability where many researchers indiscriminantly apply a specific value of 0.2 regardless of the collection. Our efforts, however, demonstrate that applying this specific value without tuning for the document collection degrades average precision by as much as 20%.
متن کاملScore Normalization Methods Applied to Topic Identification
Multi-label classification plays the key role in modern categorization systems. Its goal is to find a set of labels belonging to each data item. In the multilabel document classification unlike in the multi-class classification, where only the best topic is chosen, the classifier must decide if a document does or does not belong to each topic from the predefined topic set. We are using the gene...
متن کاملInformation Space Gets Normal
Experiments are presented based on unofficial results for TREC-7. Eigensystems analysis of a term cooccurrence matrix is compared to eigensystems analysis of a term correlation matrix. For each matrix type, the effect of term weighting and document length normalization is assessed. Recall-precision curves and other TREC statistics indicate that the use of the correlation matrix improves perform...
متن کاملImproving Term Frequency Normalization for Multi-topical Documents and Application to Language Modeling Approaches
Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a docu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010